{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ULdrhOaVbsdO"
      },
      "source": [
        "# Acme: Tutorial\n",
        "\n",
        "\u003ca href=\"https://colab.research.google.com/github/deepmind/acme/blob/master/examples/tutorial.ipynb\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\"\u003e\u003c/a\u003e\n",
        "\n",
        "This colab provides an overview of how Acme's modules can be stacked together to create reinforcement learning agents. It shows how to fit networks to environment specs, create actors, learners, replay buffers, datasets, adders, and full agents. It also highlights where you can swap out certain modules to create your own Acme based agents. "
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "xaJxoatMhJ71"
      },
      "source": [
        "## Installation\n",
        "\n",
        "In the first few cells we'll start by installing all of the necessary dependencies (and a few optional ones)."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "KH3O0zcXUeun"
      },
      "outputs": [],
      "source": [
        "#@title Install necessary dependencies.\n",
        "\n",
        "!sudo apt-get install -y xvfb ffmpeg\n",
        "!pip install 'gym==0.10.11'\n",
        "!pip install imageio\n",
        "!pip install PILLOW\n",
        "!pip install 'pyglet==1.3.2'\n",
        "!pip install pyvirtualdisplay\n",
        "\n",
        "!pip install dm-acme\n",
        "!pip install dm-acme[reverb]\n",
        "!pip install dm-acme[tf]\n",
        "!pip install dm-acme[envs]\n",
        "\n",
        "from IPython.display import clear_output\n",
        "clear_output()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "VEEj3Qw60y73"
      },
      "source": [
        "### Install dm_control\n",
        "\n",
        "The next cell will install environments provided by `dm_control` _if_ you have an institutional MuJoCo license. This is not necessary, but without this you won't be able to use the `dm_cartpole` environment below and can instead follow this colab using `gym` environments. To do so simply expand the following cell, paste in your license file, and run the cell.\n",
        "\n",
        "Alternatively, Colab supports using a Jupyter kernel on your local machine which can be accomplished by following the guidelines here: https://research.google.com/colaboratory/local-runtimes.html. This will allow you to install `dm_control` by following instructions in https://github.com/deepmind/dm_control and using a personal MuJoCo license.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "IbZxYDxzoz5R"
      },
      "outputs": [],
      "source": [
        "#@title Add your License\n",
        "#@test {\"skip\": true}\n",
        "mjkey = \"\"\"\n",
        "\"\"\".strip()\n",
        "\n",
        "mujoco_dir = \"$HOME/.mujoco\"\n",
        "\n",
        "# Install OpenGL dependencies\n",
        "!apt-get update \u0026\u0026 apt-get install -y --no-install-recommends \\\n",
        "  libgl1-mesa-glx libosmesa6 libglew2.0\n",
        "\n",
        "# Get MuJoCo binaries\n",
        "!wget -q https://www.roboti.us/download/mujoco200_linux.zip -O mujoco.zip\n",
        "!unzip -o -q mujoco.zip -d \"$mujoco_dir\"\n",
        "\n",
        "# Copy over MuJoCo license\n",
        "!echo \"$mjkey\" \u003e \"$mujoco_dir/mjkey.txt\"\n",
        "\n",
        "# Install dm_control\n",
        "!pip install dm_control\n",
        "\n",
        "# Configure dm_control to use the OSMesa rendering backend\n",
        "%env MUJOCO_GL=osmesa\n",
        "\n",
        "# Check that the installation succeeded\n",
        "try:\n",
        "  from dm_control import suite\n",
        "  env = suite.load('cartpole', 'swingup')\n",
        "  pixels = env.physics.render()\n",
        "except Exception as e:\n",
        "  raise e from RuntimeError(\n",
        "      'Something went wrong during installation. Check the shell output above '\n",
        "      'for more information.')\n",
        "else:\n",
        "  from IPython.display import clear_output\n",
        "  clear_output()\n",
        "  del suite, env, pixels"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "c-H2d6UZi7Sf"
      },
      "source": [
        "## Import Modules\n",
        "\n",
        "Now we can import all the relevant modules."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "HJ74Id-8MERq"
      },
      "outputs": [],
      "source": [
        "#@title Import modules.\n",
        "#python3\n",
        "\n",
        "%%capture\n",
        "import copy\n",
        "import pyvirtualdisplay\n",
        "import imageio \n",
        "import base64\n",
        "import IPython\n",
        "\n",
        "\n",
        "from acme import environment_loop\n",
        "from acme import networks\n",
        "from acme.adders import reverb as adders\n",
        "from acme.agents import actors_tf2 as actors\n",
        "from acme.datasets import reverb as datasets\n",
        "from acme.wrappers import gym_wrapper\n",
        "from acme import specs\n",
        "from acme import wrappers\n",
        "from acme.agents import d4pg\n",
        "from acme.agents import agent\n",
        "from acme.utils import tf2_utils\n",
        "from acme.utils import loggers\n",
        "\n",
        "import gym \n",
        "import dm_env\n",
        "import matplotlib.pyplot as plt\n",
        "import numpy as np\n",
        "import reverb\n",
        "import sonnet as snt\n",
        "import tensorflow as tf\n",
        "\n",
        "# Import dm_control if it exists.\n",
        "try:\n",
        "  from dm_control import suite\n",
        "except ModuleNotFoundError, OSError:\n",
        "  pass\n",
        "\n",
        "# Set up a virtual display for rendering OpenAI gym environments.\n",
        "display = pyvirtualdisplay.Display(visible=0, size=(1400, 900)).start()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "I6KuVGSk4uc9"
      },
      "source": [
        "## Load an environment\n",
        "\n",
        "We can now load an environment. In what follows we'll create an environment in order to generate and visualize a single state from that environment. Just select the environment you want to use and run the cell."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "form",
        "colab": {},
        "colab_type": "code",
        "id": "4PVlHtGF5yzt"
      },
      "outputs": [],
      "source": [
        "environment_name = 'dm_cartpole'  # @param ['dm_cartpole', 'gym_mountaincar']\n",
        "# task_name = 'balance'  # @param ['swingup', 'balance']\n",
        "\n",
        "def make_environment(domain_name='cartpole', task='balance'):\n",
        "  env = suite.load(domain_name, task)\n",
        "  env = wrappers.SinglePrecisionWrapper(env)\n",
        "  return env\n",
        "\n",
        "if 'dm_cartpole' in environment_name:\n",
        "  environment = make_environment('cartpole')\n",
        "  def render(env):\n",
        "    return env._physics.render(camera_id=0)  #pylint: disable=protected-access\n",
        "\n",
        "elif 'gym_mountaincar' in environment_name:\n",
        "  environment = gym_wrapper.GymWrapper(gym.make('MountainCarContinuous-v0'))\n",
        "  environment = wrappers.SinglePrecisionWrapper(environment)\n",
        "  def render(env):\n",
        "    return env.environment.render(mode='rgb_array')\n",
        "else:\n",
        "  raise ValueError('Unknown environment: {}.'.format(environment_name))\n",
        "\n",
        "# Show the frame.\n",
        "frame = render(environment)\n",
        "plt.imshow(frame)\n",
        "plt.axis('off')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dQJprnmn41fP"
      },
      "source": [
        "### Environment Spec\n",
        "\n",
        "We will later interact with the environment in a loop corresponding to the following diagram: \n",
        "\n",
        "\u003cimg src=\"https://github.com/deepmind/acme/raw/master/docs/diagrams/environment_loop.png\" width=\"500\" /\u003e\n",
        "\n",
        "But before we start building an agent to interact with this environment, let's first look at the types of objects the environment either returns (e.g. observations) or consumes (e.g. actions). The `environment_spec` will show you the form of the *observations*, *rewards* and *discounts* that the environment exposes and the form of the *actions* that can be taken.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "lph6pbmu7UfZ"
      },
      "outputs": [],
      "source": [
        "environment_spec = specs.make_environment_spec(environment)\n",
        "\n",
        "print('actions:\\n', environment_spec.actions, '\\n')\n",
        "print('observations:\\n', environment_spec.observations, '\\n')\n",
        "print('rewards:\\n', environment_spec.rewards, '\\n')\n",
        "print('discounts:\\n', environment_spec.discounts, '\\n')"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "M0h2xojW9wT2"
      },
      "source": [
        "## Build a policy network that maps observations to actions. \n",
        "\n",
        "The most important part of a reinforcement learning algorithm is potentially the policy that maps environment observations to actions. We can use a simple neural network to create a policy, in this case a simple feedforward MLP with layer norm. For our TensorFlow agents we make use of the `sonnet` library to specify networks or modules; all of the networks we will work with also have an initial batch dimension which allows for batched inference/learning.\n",
        "\n",
        "It is possible that the the observations returned by the environment are nested in some way: e.g. environments from the `dm_control` suite are frequently returned as dictionaries containing `position` and `velocity` entries. Our network is allowed to arbitarily map this dictionary to produce an action, but in this case we will simply concatenate these observations before feeding it through the MLP. We can do so using Acme's `batch_concat` utility to flatten the nested observation into a single dimension for each batch. If the observation is already flat this will be a no-op.\n",
        "\n",
        "Similarly, the output of the MLP is may have a different range of values than the action spec dictates. For this, we can use Acme's `TanhToSpec` module to rescale our actions to meet the spec."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "JZKWAFEQz5NP"
      },
      "outputs": [],
      "source": [
        "# Calculate how big the last layer should be based on total # of actions.\n",
        "action_spec = environment_spec.actions\n",
        "action_size = np.prod(action_spec.shape, dtype=int)\n",
        "exploration_sigma = 0.3\n",
        "\n",
        "# In order the following modules:\n",
        "# 1. Flatten the observations to be [B, ...] where B is the batch dimension.\n",
        "# 2. Define a simple MLP which is the guts of this policy.\n",
        "# 3. Make sure the output action matches the spec of the actions.\n",
        "policy_modules = [\n",
        "    tf2_utils.batch_concat,\n",
        "    networks.LayerNormMLP(layer_sizes=(300, 200, action_size)),\n",
        "    networks.TanhToSpec(spec=environment_spec.actions)]\n",
        "\n",
        "policy_network = snt.Sequential(policy_modules)\n",
        "\n",
        "# We will also create a version of this policy that uses exploratory noise.\n",
        "behavior_network = snt.Sequential(\n",
        "    policy_modules + [networks.ClippedGaussian(exploration_sigma),\n",
        "                      networks.ClipToSpec(action_spec)])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "9FbmoOpZKwid"
      },
      "source": [
        "## Create an actor\n",
        "\n",
        "An `Actor` is the part of our framework that directly interacts with an environment by generating actions. In more detail the earlier diagram can be expanded to show exactly how this interaction occurs:\n",
        "\n",
        "\u003cimg src=\"https://github.com/deepmind/acme/raw/master/docs/diagrams/actor_loop.png\" width=\"500\" /\u003e\n",
        "\n",
        "While you can always write your own actor, in Acme we also provide a number of useful premade versions. For the network we specified above we will make use of a `FeedForwardActor` that wraps a single feed forward network and knows how to do things like handle any batch dimensions or record observed transitions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "rpTs49OWKvMV"
      },
      "outputs": [],
      "source": [
        "actor = actors.FeedForwardActor(policy_network)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "oMxSAzDWYhC4"
      },
      "source": [
        "All actors have the following public methods and attributes:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "iKoARV3gYfcV"
      },
      "outputs": [],
      "source": [
        "[method_or_attr for method_or_attr in dir(actor)  # pylint: disable=expression-not-assigned\n",
        " if not method_or_attr.startswith('_')]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "3qrb2ZGhAoR5"
      },
      "source": [
        "## Evaluate the random actor's policy.\n",
        "\n",
        "\n",
        "Although we have instantiated an actor with a policy, the policy has not yet learned to achieve any task reward, and is essentially just acting randomly. However this is a perfect opportunity to see how the actor and environment interact. Below we define a simple helper function to display a video given frames from this interaction, and we show 500 steps of the actor taking actions in the world."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "OIJRbtAlxQVu"
      },
      "outputs": [],
      "source": [
        "def display_video(frames, filename='temp.mp4'):\n",
        "  \"\"\"Save and display video.\"\"\"\n",
        "  # Write video\n",
        "  with imageio.get_writer(filename, fps=60) as video:\n",
        "    for frame in frames:\n",
        "      video.append_data(frame)\n",
        "  # Read video and display the video\n",
        "  video = open(filename, 'rb').read()\n",
        "  b64_video = base64.b64encode(video)\n",
        "  video_tag = ('\u003cvideo  width=\"320\" height=\"240\" controls alt=\"test\" '\n",
        "               'src=\"data:video/mp4;base64,{0}\"\u003e').format(b64_video.decode())\n",
        "  return IPython.display.HTML(video_tag)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "wdCeuvHeUwwm"
      },
      "outputs": [],
      "source": [
        "# Run the actor in the environment for desired number of steps.\n",
        "frames = []\n",
        "num_steps = 500\n",
        "timestep = environment.reset()\n",
        "\n",
        "for _ in range(num_steps):\n",
        "  frames.append(render(environment))\n",
        "  action = actor.select_action(timestep.observation)\n",
        "  timestep = environment.step(action)\n",
        "\n",
        "# Save video of the behaviour.\n",
        "display_video(np.array(frames))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "iz_OYw60MHmc"
      },
      "source": [
        "## Storing actor experiences in a replay buffer\n",
        "\n",
        "Many RL agents utilize a data structure such as a replay buffer to store data from the environment (e.g. observations) along with actions taken by the actor. This data will later be fed into a learning process in order to update the policy. Again we can expand our earlier diagram to include this step:\n",
        "\n",
        "\u003cimg src=\"https://github.com/deepmind/acme/raw/master/docs/diagrams/batch_loop.png\" width=\"500\" /\u003e\n",
        "\n",
        "In order to make this possible, Acme leverages [Reverb](https://github.com/deepmind/reverb) which is an efficient and easy-to-use data storage and transport system designed for Machine Learning research. Below we will create the replay buffer before interacting with it."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "uQ0NAMwGWHu6"
      },
      "outputs": [],
      "source": [
        "# Create a table with the following attributes:\n",
        "# 1. when replay is full we remove the oldest entries first.\n",
        "# 2. to sample from replay we will do so uniformly at random.\n",
        "# 3. before allowing sampling to proceed we make sure there is at least\n",
        "#    one sample in the replay table.\n",
        "# 4. we use a default table name so we don't have to repeat it many times below;\n",
        "#    if we left this off we'd need to feed it into adders/actors/etc. below.\n",
        "replay_buffer = reverb.Table(\n",
        "    name=adders.DEFAULT_PRIORITY_TABLE,\n",
        "    max_size=1000000,\n",
        "    remover=reverb.selectors.Fifo(),\n",
        "    sampler=reverb.selectors.Uniform(),\n",
        "    rate_limiter=reverb.rate_limiters.MinSize(min_size_to_sample=1))\n",
        "\n",
        "# Get the server and address so we can give it to the modules such as our actor\n",
        "# that will interact with the replay buffer.\n",
        "replay_server = reverb.Server([replay_buffer], port=None)\n",
        "replay_server_address = 'localhost:%d' % replay_server.port"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "joJaFWKjNep-"
      },
      "source": [
        "We could interact directly with Reverb in order to add data to replay. However in Acme we have an additional layer on top of this data-storage that allows us to use the same interface no matter what kind of data we are inserting. \n",
        "\n",
        "This layer in Acme corresponds to an `Adder` which adds experience to a data table. We provide several adders that differ based on the type of information that is desired to be stored in the table, however in this case we will make use of an `NStepTransitionAdder` which stores simple transitions (if N=1) or accumulates N-steps to form an aggregated transition."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "9CozET1ENd_j"
      },
      "outputs": [],
      "source": [
        "# Create a 5-step transition adder where in between those steps a discount of\n",
        "# 0.99 is used (which should be the same discount used for learning).\n",
        "adder = adders.NStepTransitionAdder(\n",
        "    client=reverb.Client(replay_server_address),\n",
        "    n_step=5,\n",
        "    discount=0.99)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "rMOgmNpsiyC5"
      },
      "source": [
        "We can either directly use the adder to add transitions to replay directly using the `add()` and `add_first()` methods as follows:"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "ia16FN4dPWk4"
      },
      "outputs": [],
      "source": [
        "num_episodes = 2  #@param\n",
        "\n",
        "for episode in range(num_episodes):\n",
        "  timestep = environment.reset()\n",
        "  adder.add_first(timestep)\n",
        "\n",
        "  while not timestep.last():\n",
        "    action = actor.select_action(timestep.observation)\n",
        "    timestep = environment.step(action)\n",
        "    adder.add(action=action, next_timestep=timestep)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "vxNHy-h2Wl9q"
      },
      "source": [
        "Since this is a common enough way to observe data, `Actor`s in Acme generally take an `Adder` instance that they use to define their observation methods. We saw earlier that the `FeedForwardActor` like all `Actor`s defines `observe` and `observe_first` methods. If we give the actor an `Adder` instance at init then it will use this adder to make observations."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Xp-YYHaHWQg_"
      },
      "outputs": [],
      "source": [
        "actor = actors.FeedForwardActor(policy_network=behavior_network, adder=adder)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ertrRdjOZHZ3"
      },
      "source": [
        "Below we repeat the same process, but using `actor` and its `observe` methods. We note these subtle changes below."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "rJ0_UCLeZPOt"
      },
      "outputs": [],
      "source": [
        "num_episodes = 2  #@param\n",
        "\n",
        "for episode in range(num_episodes):\n",
        "  timestep = environment.reset()\n",
        "  actor.observe_first(timestep)  # Note: observe_first.\n",
        "\n",
        "  while not timestep.last():\n",
        "    action = actor.select_action(timestep.observation)\n",
        "    timestep = environment.step(action)\n",
        "    actor.observe(action=action, next_timestep=timestep)  # Note: observe."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "OVj1cBw8epeh"
      },
      "source": [
        "## Learning from experiences in replay\n",
        "Acme provides multiple learning algorithms/agents.  Here, we will use the Acme's D4PG learning algorithm to learn from the data collected by the actor. To do so, we first create a TensorFlow dataset from the Reverb table using the `make_dataset` function. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "IbDfFYTRqmoz"
      },
      "outputs": [],
      "source": [
        "# This connects to the created reverb server; also note that we use a transition\n",
        "# adder above so we'll tell the dataset function that so that it knows the type\n",
        "# of data that's coming out.\n",
        "dataset = datasets.make_dataset(\n",
        "    client=reverb.TFClient(server_address=replay_server_address),\n",
        "    batch_size=256,\n",
        "    environment_spec=environment_spec,\n",
        "    transition_adder=True)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "G5VUXDBytDu9"
      },
      "source": [
        "In what follows we'll make use of D4PG, an actor-critic learning algorithm. D4PG is a somewhat complicated algorithm, so we'll leave a full explanation of this method to the accompanying paper (see the documentation).\n",
        "\n",
        "However, since D4PG is an actor-critic algorithm we will have to specify a critic for it (a value function). In this case D4PG uses a distributional critic as well. D4PG also makes use of online and target networks so we need to create copies of both the policy_network (from earlier) and the new critic network we are about to create.\n",
        "\n",
        "To build our critic networks, we use a ***multiplexer***, which is simply a neural network module that takes multiple inputs and processes them in different ways before combining them and processing further. In the case of Acme's `CriticMultiplexer`, the inputs are observations and actions, each with their own network torso. There is then a critic network module that processes the outputs of the observation network and the action network and outputs a tensor.\n",
        "\n",
        "Finally, in order to optimize these networks the learner must receive networks with the variables created. We have utilities in Acme to handle exactly this, and we do so in the final lines of the following code block."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "OgTk7IAKXKz0"
      },
      "outputs": [],
      "source": [
        "critic_network = snt.Sequential([\n",
        "    networks.CriticMultiplexer(\n",
        "        observation_network=tf2_utils.batch_concat,\n",
        "        action_network=tf.identity,\n",
        "        critic_network=networks.LayerNormMLP(\n",
        "            layer_sizes=(400, 300),\n",
        "            activate_final=True)),\n",
        "    # Value-head gives a 51-atomed delta distribution over state-action values.\n",
        "    networks.DiscreteValuedHead(vmin=-150., vmax=150., num_atoms=51)])\n",
        "\n",
        "# Create the target networks\n",
        "target_policy_network = copy.deepcopy(policy_network)\n",
        "target_critic_network = copy.deepcopy(critic_network)\n",
        "\n",
        "# We must create the variables in the networks before passing them to learner.\n",
        "tf2_utils.create_variables(network=policy_network,\n",
        "                           input_spec=[environment_spec.observations])\n",
        "tf2_utils.create_variables(network=critic_network,\n",
        "                           input_spec=[environment_spec.observations,\n",
        "                                       environment_spec.actions])\n",
        "tf2_utils.create_variables(network=target_policy_network,\n",
        "                           input_spec=[environment_spec.observations])\n",
        "tf2_utils.create_variables(network=target_critic_network,\n",
        "                           input_spec=[environment_spec.observations,\n",
        "                                       environment_spec.actions])"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ENTZ_cj3GiLr"
      },
      "source": [
        "We can now create a learner that uses these networks. Note that here we're using the same discount factor as was used in the transition adder. The rest of the parameters are reasonable defaults.\n",
        "\n",
        "Note however that we will log output to the terminal at regular intervals. We have also turned off checkpointing of the network weights (i.e. saving them). This is usually used by default but can cause issues with interactive colab sessions."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "5DkJrgkSW94O"
      },
      "outputs": [],
      "source": [
        "learner = d4pg.D4PGLearner(policy_network=policy_network,\n",
        "                           critic_network=critic_network,\n",
        "                           target_policy_network=target_policy_network,\n",
        "                           target_critic_network=target_critic_network,\n",
        "                           dataset=dataset,\n",
        "                           discount=0.99,\n",
        "                           target_update_period=100,\n",
        "                           policy_optimizer=snt.optimizers.Adam(1e-4),\n",
        "                           critic_optimizer=snt.optimizers.Adam(1e-4),\n",
        "                           # Log learner updates to console every 10 seconds.\n",
        "                           logger=loggers.TerminalLogger(time_delta=10.),\n",
        "                           checkpoint=False)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "ANG_L3e1dGoT"
      },
      "source": [
        "Inspecting the learner's public methods we see that it primarily exists to expose its variables and update them. IE this looks remarkably similar to supervised learning."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "Tl3Hu9nHcyp5"
      },
      "outputs": [],
      "source": [
        "[method_or_attr for method_or_attr in dir(learner)  # pylint: disable=expression-not-assigned\n",
        " if not method_or_attr.startswith('_')]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "oG9SglevFrmQ"
      },
      "source": [
        "The learner's `step()` method samples a batch of data from the replay dataset given to it, and performs optimization using the optimizer, logging loss metrics along the way. Note: in order to sample from the replay dataset, there must be at least 1000 elements in the replay buffer (which should already have from the actor's added experiences.)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "estCksXXFi2_"
      },
      "outputs": [],
      "source": [
        "learner.step()"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "dL6U_Wi2HTA2"
      },
      "source": [
        "# Training loop\n",
        "Finally, we can put all of the pieces together and run some training steps in the environment, alternating the actor's experience gathering with learner's learning.\n",
        "\n",
        "This is a simple training loop that runs for `num_training_episodes` episodes where the actor and learner take turns generating and learning from experience respectively:\n",
        "\n",
        "- Actor acts in environment \u0026 adds experience to replay for `min_actor_steps_per_iteration` steps\u003cbr\u003e\n",
        "- Learner samples from replay data and learns from it for `num_learner_steps_per_iteration` steps\u003cbr\u003e\n",
        "\n",
        "Note: Since the learner and actor are sharing a policy network, any learning done on the learner, automatically is transferred to the actor's policy.\n"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "cellView": "both",
        "colab": {},
        "colab_type": "code",
        "id": "zC0PgbeyHSSP"
      },
      "outputs": [],
      "source": [
        "num_training_episodes =  10 # @param {type: \"integer\"}\n",
        "min_actor_steps_before_learning = 1000  # @param {type: \"integer\"}\n",
        "num_actor_steps_per_iteration =   100 # @param {type: \"integer\"}\n",
        "num_learner_steps_per_iteration = 1  # @param {type: \"integer\"}\n",
        "\n",
        "learner_steps_taken = 0\n",
        "actor_steps_taken = 0\n",
        "for episode in range(num_training_episodes):\n",
        "  \n",
        "  timestep = environment.reset()\n",
        "  actor.observe_first(timestep)\n",
        "  episode_return = 0\n",
        "\n",
        "  while not timestep.last():\n",
        "    # Get an action from the agent and step in the environment.\n",
        "    action = actor.select_action(timestep.observation)\n",
        "    next_timestep = environment.step(action)\n",
        "\n",
        "    # Record the transition.\n",
        "    actor.observe(action=action, next_timestep=next_timestep)\n",
        "\n",
        "    # Book-keeping.\n",
        "    episode_return += next_timestep.reward\n",
        "    actor_steps_taken += 1\n",
        "    timestep = next_timestep\n",
        "\n",
        "    # See if we have some learning to do.\n",
        "    if (actor_steps_taken \u003e= min_actor_steps_before_learning and\n",
        "        actor_steps_taken % num_actor_steps_per_iteration == 0):\n",
        "      # Learn.\n",
        "      for learner_step in range(num_learner_steps_per_iteration):\n",
        "        learner.step()\n",
        "      learner_steps_taken += num_learner_steps_per_iteration\n",
        "\n",
        "  # Log quantities.\n",
        "  print('Episode: %d | Return: %f | Learner steps: %d | Actor steps: %d'%(\n",
        "      episode, episode_return, learner_steps_taken, actor_steps_taken))"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "Z5w7y3iYYck7"
      },
      "source": [
        "## Putting it all together: an Acme agent\n",
        "\n",
        "\u003cimg src=\"https://github.com/deepmind/acme/raw/master/docs/diagrams/agent_loop.png\" width=\"500\" /\u003e\n",
        "\n",
        "Now that we've used all of the pieces and seen how they can interact, there's one more way we can put it all together. In the Acme design scheme, an agent is an entity with both a learner and an actor component that will piece together their interactions internally. An agent handles the interchange between actor adding experiences to the replay buffer and learner sampling from it and learning and in turn, sharing its weights back with the actor. \n",
        "\n",
        "Similar to how we used `num_actor_steps_per_iteration` and `num_learner_steps_per_iteration` parameters in our custom training loop above, the agent parameters `min_observations` and `observations_per_step` specify the structure of the agent's training loop. \n",
        "* `min_observations` specifies how many actor steps need to have happened to start learning.\n",
        "*  `observations_per_step` specifies how many actor steps should occur in between each learner step. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "bqFpIHE-aRRg"
      },
      "outputs": [],
      "source": [
        "d4pg_agent = agent.Agent(actor=actor,\n",
        "                         learner=learner,\n",
        "                         min_observations=1000,\n",
        "                         observations_per_step=8.)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "o63pJeJl-Lv_"
      },
      "source": [
        "Of course we could have just used the `agents.D4PG` agent directly which sets\n",
        "all of this up for us. We'll stick with this agent we've just created, but most of the steps outlined in this tutorial can be skipped by just making use of a\n",
        "prebuilt agent and the environment loop."
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "F25PNc1GWV0R"
      },
      "source": [
        "## Training the full agent"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "yRdwvHMQVoBB"
      },
      "source": [
        "To simplify collecting and storing experiences, you can also directly use Acme's `EnvironmentLoop` which runs the environment loop for a specified number of episodes. Each episode is itself a loop which interacts first with the environment to get an observation and then give that observation to the agent in order to retrieve an action. Upon termination of an episode a new episode will be started. If the number of episodes is not given then this will interact with the environment infinitely."
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "evoLYDDagdv3"
      },
      "outputs": [],
      "source": [
        "# This may be necessary if any of the episodes were cancelled above.\n",
        "adder.reset()\n",
        "\n",
        "# We also want to make sure the logger doesn't write to disk because that can\n",
        "# cause issues in colab on occasion.\n",
        "logger = loggers.TerminalLogger(time_delta=10.)"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "5Gu5D2obWTBc"
      },
      "outputs": [],
      "source": [
        "loop = environment_loop.EnvironmentLoop(environment, d4pg_agent, logger=logger)\n",
        "loop.run(num_episodes=50)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "colab_type": "text",
        "id": "EAn543WQfLNw"
      },
      "source": [
        "## Evaluate the D4PG agent\n",
        "\n",
        "We can now evaluate the agent. Note that this will use the noisy behavior policy, and so won't quite be optimal. If we wanted to be absolutely precise we could easily replace this with the noise-free policy. Note that the optimal policy can get about 1000 reward in this environment. D4PG should generally get to that within 50-100 learner steps. We've cut it off at 50 and not dropped the behavior noise just to simplify this tutorial. "
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 0,
      "metadata": {
        "colab": {},
        "colab_type": "code",
        "id": "2__mFiraWND1"
      },
      "outputs": [],
      "source": [
        "# Run the actor in the environment for desired number of steps.\n",
        "frames = []\n",
        "num_steps = 500\n",
        "timestep = environment.reset()\n",
        "\n",
        "for _ in range(num_steps):\n",
        "  frames.append(render(environment))\n",
        "  action = d4pg_agent.select_action(timestep.observation)\n",
        "  timestep = environment.step(action)\n",
        "\n",
        "# Save video of the behaviour.\n",
        "display_video(np.array(frames))"
      ]
    }
  ],
  "metadata": {
    "colab": {
      "collapsed_sections": [
        "VEEj3Qw60y73"
      ],
      "name": "Acme: Tutorial",
      "provenance": []
    },
    "kernelspec": {
      "display_name": "Python 3",
      "name": "python3"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
